Chapter 11 XML
This chapter shows you how to process the recently released BNC 2014, which is by far the largest representative collection of spoken English collected in UK. For the purpose of our in-class tutorials, I have included a small sample of the BNC2014 in our demo_data. However, the whole dataset is now available via the official website: British National Corpus 2014. Please sign up for the complete access to the corpus if you need this corpus for your own research.
11.1 BNC Spoken 2014
XML is similar to HTML. Before you process the data, you need to understand the structure of the XML tags in the files. Other than that, the steps are pretty much similar to what we have done before.
First, we read the XML using read_html():
Now it is intuitive that our next step is to extract all utterances (with the tag of <u>...</u>) in the XML file. So you may want to do the following:
## [1] "\r\nanhourlaterhopeshestaysdownratherlate"
## [2] "\r\nwellshehadthosetwohoursearlier"
## [3] "\r\nyeahIknowbutthat'swhywe'reanhourlateisn'tit?mmI'mtirednow"
## [4] "\r\n"
## [5] "\r\ndidyoutext--ANONnameM"
## [6] "\r\nyeahyeahhewrotebacknobotherlad"
See the problem?
Using the above method, you lose the word boundary information from the corpus.
What if you do the following?
## [1] "an" "hour" "later" "hope" "she" "stays" "down"
## [8] "rather" "late" "well" "she" "had" "those" "two"
## [15] "hours" "earlier" "yeah" "I" "know" "but"
At the first sight, probably it seems that we have solved the problem but we don’t. There are even more problems created:
- Our second method does not extract non-word tokens within each utterance (e.g.,
<pause .../>,<vocal .../>) - Our second method loses the utterance information (i.e., we don’t know which utterance each word belongs to)
So we cannot extract <u> elements all at once; nor can we extract all <w> elements all at once. Probably we need to process each <u> node one at a time.
First, let’s get all the <u> nodes.
## {html_node}
## <u n="1" who="S0024" trans="nonoverlap" whoconfidence="high">
## [1] <w pos="AT1" lemma="a" class="ART" usas="Z5">an</w>
## [2] <w pos="NNT1" lemma="hour" class="SUBST" usas="T1:3">hour</w>
## [3] <w pos="RRR" lemma="later" class="ADV" usas="T4">later</w>
## [4] <pause dur="short"></pause>
## [5] <w pos="VV0" lemma="hope" class="VERB" usas="X2:6">hope</w>
## [6] <w pos="PPHS1" lemma="she" class="PRON" usas="Z8">she</w>
## [7] <w pos="VVZ" lemma="stay" class="VERB" usas="M8">stays</w>
## [8] <w pos="RP" lemma="down" class="ADV" usas="Z5">down</w>
## [9] <pause dur="short"></pause>
## [10] <w pos="RG" lemma="rather" class="ADV" usas="A13:5">rather</w>
## [11] <w pos="JJ" lemma="late" class="ADJ" usas="T4">late</w>
Take the first node in the XML document for example, each utterance node includes words as well as non-word tokens (i.e., paralinguistic annotations <pause ...></pause>). We can retrieve:
- words in an utterance
- lemma forms of all words in the utterance
- pos tags of all words in the utterance (BNC2014 uses UCREL CLAWS6 Tagset)
- paralinguistic tags in the utterance
## [1] "an" "hour" "later" "" "hope" "she" "stays" "down"
## [9] "" "rather" "late"
## [1] "AT1" "NNT1" "RRR" NA "VV0" "PPHS1" "VVZ" "RP" NA
## [10] "RG" "JJ"
## [1] "a" "hour" "later" NA "hope" "she" "stay" "down"
## [9] NA "rather" "late"
Exercise 11.1 Please come up with a way to extract both word and non-word tokens from each utterance. Ideally, the resulting data frame should consist of rows being the utterances, and columns including the attributes of each autterances.
Most importantly, the data frame should record not only the strings of the utterance but at the same time for the word tokens, it should preserve the token-level annotation of word part-of-speech tags (see theutterance column in the table below). A sample utterance-based data frame is provided below.
11.2 Process the Whole Directory of BNC2014 Sample
11.2.1 Define Function
In Section 11.1, if you have figured how to extract utterances as well as token-based information from the xml file, you can easily wrap the whole procedure as one function. With this function, we can perform the same procedure to all the xml files of the BNC2014.
For example, let’s assume that we have defined a function:
read_xml_bnc2014 <- function(xml){
...
}
This function takes one xml file as an argument and return a data frame, consisting of utterances and other relevant token-level information from the xml.
Exercise 11.2 Now your job is to write this function, read_xml_BNC2014(xml = "").
11.2.2 Process the all files in the Directory
Now we utilize the self-defined function, read_xml_BNC2014(), and process all xml files in the demo_data/corp-bnc-spoken2014-sample/. Also, we combine the individual data.frame returned from each xml into a bigger one, i.e., corp_bnc_df:
s.t <- Sys.time()
bnc_flist <- dir("demo_data/corp-bnc-spoken2014-sample/",full.names = T)
corp_bnc_df <- map(bnc_flist,
function(x) read_xml_bnc2014(x)) %>% # map `read_xml_bnc2014()` to each xml in the dir
do.call(rbind, .) # rbind all individual DFs
Sys.time()-s.t## Time difference of 56.67211 secs
It takes about one and half minute to process the sample directory. You may store this corp_bnc_df data frame output for later use so that you don’t have to process the XML files every time you work with BNC2014.
11.3 Metadata
The best thing about BNC2014 is its rich demographic information relating to the settings and speakers of the conversations collected. The whole corpus comes with two metadata sets:
bnc2014spoken-textdata.tsv: metadata for each text transcriptbnc2014spoken-speakerdata.tsv: metadata for each speaker ID
These two metadata sets allow us to get more information about each transcript as well as the speakers in those transcripts.
11.3.1 Text Metadata
There are two files that are relevant to the text metadata:
bnc2014spoken-textdata.tsv: This file includes the header/metadata information of each text filemetadata-fields-text.txt: This file includes the column names/meanings of the previous text metadata tsv, i.e.,bnc2014spoken-textdata.tsv.
bnc_text_meta <- read_tsv("demo_data/corp-bnc-spoken2014-metadata/bnc2014spoken-textdata.tsv", col_names = FALSE)
bnc_text_metabnc_text_meta_names <-read_tsv("demo_data/corp-bnc-spoken2014-metadata/metadata-fields-text.txt", skip =2, col_names = F)
bnc_text_meta_names11.3.2 Speaker Metadata
There are two files that are relevant to the speaker metadata:
bnc2014spoken-speakerdata.tsv: This file includes the demographic information of each speakermetadata-fields-speaker.txt: This file includes the column names/meanings of the previous speaker metadata tsv, i.e.,bnc2014spoken-speakerdata.tsv.
bnc_sp_meta <- read_tsv("demo_data/corp-bnc-spoken2014-metadata/bnc2014spoken-speakerdata.tsv", col_names = F)
bnc_sp_metabnc_sp_meta_names <- read_tsv("demo_data/corp-bnc-spoken2014-metadata/metadata-fields-speaker.txt", skip = 3, col_names = F)
bnc_sp_meta_names11.5 Word Frequency vs. Gender
Now we are ready to explore the gender differences in language.
11.5.1 Preprocessing
To begin with, there are some utterances with no words at all. We probably like to remove these tokens.
#corp_bnc_df <- read_csv("demo_data/corp_bnc_df.csv")
corp_bnc_df <- corp_bnc_df %>%
filter(!is.na(utterance))
corp_bnc_df11.5.2 Target Structures
Let’s assume that we like to know which adjectives are most frequently used by men and women.
corp_bnc_verb_gender <- corp_bnc_df %>%
filter(str_detect(utterance, "_(JJ)|(JJR)|(JJT)")) %>% # extract utterances with at least one adj
left_join(bnc_sp_meta, by = c("who"="spid")) # attach SP metadata to the data frame
corp_bnc_verb_gender11.5.3 Analysis
After we extract utterances with our target structures, we tokenize the utterances and create frequency lists of target structures, i.e., the adjectives.
## Problems
### Use own tokenization function
### Default tokenization increase the number of tokens quite a bit
word_by_gender <- corp_bnc_verb_gender %>%
unnest_tokens(word,
utterance,
to_lower = F,
token = function(x) strsplit(x, split = "\\s")) %>% # tokenize utterance into words
filter(str_detect(word, "[^_]+_(JJ)|(JJR)|(JJT)")) %>% # include adj only
mutate(word = str_replace(word, "_(JJ)|(JJR)|(JJT)","")) %>% # remove pos tags
count(gender, word)
word_by_gender_top200 <- word_by_gender %>%
group_by(gender) %>%
top_n(200,n) %>%
ungroup
word_by_gender_top200- Female wordcloud
require(wordcloud2)
word_by_gender_top200 %>%
filter(gender=="F") %>%
select(word, n) %>%
wordcloud2(size = 2, minRotation = -pi/2, maxRotation = -pi/2)
- Male wordcloud
word_by_gender_top200 %>%
filter(gender=="M") %>%
select(word, n) %>%
wordcloud2(size = 2, minRotation = -pi/2, maxRotation = -pi/2)
Exercise 11.3 Which adjectives are more often used by male and female speakers? This should be a statistical problem. We can in fact extend our keyword analysis (cf. Chapter 6) to this question.
Please use the statistics of keyword analysis to find out the top 20 adjectives that are strongly attracted to female and male speakers according to G2 statistics. Please include in the analysis words whose frequencies >= 20 in the corpus.
Also, please note the problem of theNaN values out of the log().
11.6 Degree ADV + ADJ
In this example, I would like to look at the adjectives that are emphasized in conversations and examine how these emphatic adjective may differ in speakers of different genders.
corp_bnc_pattern_gender <- corp_bnc_df %>%
filter(str_detect(utterance, "[^_]+_RG [^_]+_JJ")) %>% # extract utterances with at least one verb
left_join(bnc_sp_meta, by = c("who"="spid")) # attach SP metadata to the data frame
pattern_by_gender <- corp_bnc_pattern_gender %>%
unnest_tokens(pattern,
utterance,
to_lower = F,
token = function(x) str_extract_all(x, "[^_ ]+_RG [^_ ]+_JJ ")) %>%
mutate(pattern = pattern %>% str_trim %>% str_replace_all("_[^_ ]+","")) %>% # remove pos tags
separate(pattern, into = c("ADV","ADJ"), sep = "\\s") %>%
count(gender, ADJ) %>%
group_by(gender) %>%
top_n(100,n) %>%
ungroup
pattern_by_genderpattern_by_gender %>%
filter(gender=="M") %>%
select(ADJ, n) %>%
wordcloud2(size = 2, minRotation = -pi/2, maxRotation = -pi/2)
pattern_by_gender %>%
filter(gender=="M") %>%
select(ADJ, n) %>%
wordcloud2(size = 2, minRotation = -pi/2, maxRotation = -pi/2)
I in BNC2014 in terms of speakers of different genders. Please create a frequency list of verbs that follow the first person pronoun I in demo_data/corp-bnc-spoken2014-sample. Verbs are defined as any words whose POS tag starts with VV. Please create the word clouds of the top 100 verbs for male and female speakers.

